dataset pair
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.67)
A Experiment Details for Reproducibility
For all the other datasets, we follow their original train/dev/test splits. We fine-tune a pre-trained language model (e.g., BERT -Base) over the source training set to generate the source model. Source test set is used for evaluating the "source F1" Statistics of each dataset pair are included in Table 9. Batch size is set to be 32 in all experiments for all the methods. We conduct grid search on learning rate and regularization strength for each experiment using the target dev set. Then we train the model using this hyper-parameter configuration with two additional random seeds and report the mean and standard deviation.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- Europe > France > Grand Est > Meurthe-et-Moselle > Nancy (0.04)
- (2 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.67)
M3OOD: Automatic Selection of Multimodal OOD Detectors
Qin, Yuehan, Li, Li, Cao, Defu, Yang, Tiankai, Zhao, Yue
Out-of-distribution (OOD) robustness is a critical challenge for modern machine learning systems, particularly as they increasingly operate in multimodal settings involving inputs like video, audio, and sensor data. Currently, many OOD detection methods have been proposed, each with different designs targeting various distribution shifts. A single OOD detector may not prevail across all the scenarios; therefore, how can we automatically select an ideal OOD detection model for different distribution shifts? Due to the inherent unsupervised nature of the OOD detection task, it is difficult to predict model performance and find a universally Best model. Also, systematically comparing models on the new unseen data is costly or even impractical. To address this challenge, we introduce M3OOD, a meta-learning-based framework for OOD detector selection in multimodal settings. Meta learning offers a solution by learning from historical model behaviors, enabling rapid adaptation to new data distribution shifts with minimal supervision. Our approach combines multimodal embeddings with handcrafted meta-features that capture distributional and cross-modal characteristics to represent datasets. By leveraging historical performance across diverse multimodal benchmarks, M3OOD can recommend suitable detectors for a new data distribution shift. Experimental evaluation demonstrates that M3OOD consistently outperforms 10 competitive baselines across 12 test scenarios with minimal computational overhead.
- Research Report > New Finding (0.46)
- Research Report > Experimental Study (0.46)
A Experiment Details for Reproducibility
For all the other datasets, we follow their original train/dev/test splits. We fine-tune a pre-trained language model (e.g., BERT -Base) over the source training set to generate the source model. Source test set is used for evaluating the "source F1" Statistics of each dataset pair are included in Table 9. Batch size is set to be 32 in all experiments for all the methods. We conduct grid search on learning rate and regularization strength for each experiment using the target dev set. Then we train the model using this hyper-parameter configuration with two additional random seeds and report the mean and standard deviation.
MetaOOD: Automatic Selection of OOD Detection Models
Qin, Yuehan, Zhang, Yichi, Nian, Yi, Ding, Xueying, Zhao, Yue
How can we automatically select an out-of-distribution (OOD) detection model for various underlying tasks? This is crucial for maintaining the reliability of open-world applications by identifying data distribution shifts, particularly in critical domains such as online transactions, autonomous driving, and real-time patient diagnosis. Despite the availability of numerous OOD detection methods, the challenge of selecting an optimal model for diverse tasks remains largely underexplored, especially in scenarios lacking ground truth labels. In this work, we introduce MetaOOD, the first zero-shot, unsupervised framework that utilizes meta-learning to automatically select an OOD detection model. As a meta-learning approach, MetaOOD leverages historical performance data of existing methods across various benchmark OOD datasets, enabling the effective selection of a suitable model for new datasets without the need for labeled data at the test time. To quantify task similarities more accurately, we introduce language model-based embeddings that capture the distinctive OOD characteristics of both datasets and detection models. Through extensive experimentation with 24 unique test dataset pairs to choose from among 11 OOD detection models, we demonstrate that MetaOOD significantly outperforms existing methods and only brings marginal time overhead. Our results, validated by Wilcoxon statistical tests, show that MetaOOD surpasses a diverse group of 11 baselines, including established OOD detectors and advanced unsupervised selection methods.
- North America > United States > California (0.14)
- North America > United States > New York > New York County > New York City (0.04)
- North America > United States > Pennsylvania > Allegheny County > Pittsburgh (0.04)
- North America > United States > Illinois > Cook County > Chicago (0.04)
- Information Technology (0.48)
- Transportation (0.34)
- Information Technology > Artificial Intelligence > Natural Language > Large Language Model (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Neural Networks > Deep Learning (1.00)
- Information Technology > Artificial Intelligence > Machine Learning > Statistical Learning (0.98)
SimClone: Detecting Tabular Data Clones using Value Similarity
Yang, Xu, Rajbahadur, Gopi Krishnan, Lin, Dayi, Wang, Shaowei, Ming, Zhen, Jiang, null
Data clones are defined as multiple copies of the same data among datasets. Presence of data clones between datasets can cause issues such as difficulties in managing data assets and data license violations when using datasets with clones to build AI software. However, detecting data clones is not trivial. Majority of the prior studies in this area rely on structural information to detect data clones (e.g., font size, column header). However, tabular datasets used to build AI software are typically stored without any structural information. In this paper, we propose a novel method called SimClone for data clone detection in tabular datasets without relying on structural information. SimClone method utilizes value similarities for data clone detection. We also propose a visualization approach as a part of our SimClone method to help locate the exact position of the cloned data between a dataset pair. Our results show that our SimClone outperforms the current state-of-the-art method by at least 20\% in terms of both F1-score and AUC. In addition, SimClone's visualization component helps identify the exact location of the data clone in a dataset with a Precision@10 value of 0.80 in the top 20 true positive predictions.
- North America > United States > Washington > King County > Seattle (0.14)
- North America > Canada > Manitoba (0.04)
- North America > United States > New York > New York County > New York City (0.04)
- (4 more...)
- Health & Medicine (0.93)
- Information Technology > Security & Privacy (0.46)
Convolutional Monge Mapping Normalization for learning on sleep data
Gnassounou, Théo, Flamary, Rémi, Gramfort, Alexandre
In many machine learning applications on signals and biomedical data, especially electroencephalogram (EEG), one major challenge is the variability of the data across subjects, sessions, and hardware devices. In this work, we propose a new method called Convolutional Monge Mapping Normalization (CMMN), which consists in filtering the signals in order to adapt their power spectrum density (PSD) to a Wasserstein barycenter estimated on training data. CMMN relies on novel closed-form solutions for optimal transport mappings and barycenters and provides individual test time adaptation to new data without needing to retrain a prediction model. Numerical experiments on sleep EEG data show that CMMN leads to significant and consistent performance gains independent from the neural network architecture when adapting between subjects, sessions, and even datasets collected with different hardware. Notably our performance gain is on par with much more numerically intensive Domain Adaptation (DA) methods and can be used in conjunction with those for even better performances.
- North America > Canada > Quebec > Montreal (0.04)
- North America > United States > Illinois > Cook County > Westchester (0.04)
- North America > United States > District of Columbia > Washington (0.04)
- (3 more...)
- Health & Medicine > Therapeutic Area > Neurology (1.00)
- Health & Medicine > Health Care Technology (0.67)